Detecting Handwritten text in Documents

Problem Statement

We wish to detect the handwritten text in the scanned/pdf document. It could be for number of reasons like

  • to identify if the document has been signed
  • to process handwritten text in the document in a different way
  • to mask the handwritten text

Take following document image for an example. We wish to detect the text highlighted in the red bounding boxes.

Each image is annotated in Pascal VOC Annotation format using Microsoft Vott Annotation tool. The directory structure of the annotated dataset looks like this

data/Annotations_99
data/JPEGImages_99

There are 99 annotated images in the dataset. The images are present in JPEGImages_99 folder and corresponding xml annotations are available under Annotations_99.

An XML annotation file looks like

<annotation verified="yes">
    <folder>Annotation</folder>
    <filename>07653e58-24d1-4b3f-9b4a-76057efe5c09-1</filename>
    <path>C:\data\JPEGImages\07653e58-24d1-4b3f-9b4a-76057efe5c09-1.jpg</path>
    <source>
        <database>Unknown</database>
    </source>
    <size>
        <width>1700</width>
        <height>2200</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>signature</name>
        <pose>Unspecified</pose>
        <bndbox>
            <xmin>192</xmin>
            <ymin>1188</ymin>
            <xmax>738</xmax>
            <ymax>1320</ymax>
         </bndbox>
    </object>
    ...

In Pascal VOC annotation, there is a seperate annotation file for each image. The data we are interested in the xml file is

  • image filename - 07653e58-24d1-4b3f-9b4a-76057efe5c09-1
  • object attribute for each annotation in the image
    • category/class of the marked annotation
    • bounding box coordinates of top left and right bottom position

About Detectron2 Framework

We will use pytorch detectron2 framework because it is simple and easy to extend. There are simple Training, Visualization, and Prediction modules available in the detectron2 which handles most of the stuff and we can use it as is, or if required, we can extend the functionality.

Simple steps to train a vision model in Detectron2

  1. Convert dataset in the detectron2 format
  2. Register the dataset and metadata information like class labels
  3. Update the config with registered dataset (DATASETS.{TRAIN,TEST}), model weight (MODEL.WEIGHT), learning rate, Number of output classes (MODEL.ROI_HEADS.NUM_CLASSES), and other training and test parameters
  4. Train the model using DefaultTrainer class

Dataset Preparation(step 1 & 2)

Detectron2 expects the dataset as list[dict] in the following format. So for training with detectron2 we will have to convert our dataset in the following format.

[{'file_name': 'datasets/JPEGImages/1.jpg',
  'image_id': '1',
  'height': 3300,
  'width': 2550,
  'annotations': [{'category_id': 1,
    'bbox': [1050.1000264270613,
     457.33333333333337,
     1406.9139799154334,
     587.7450980392157],
    'bbox_mode': <BoxMode.XYXY_ABS: 0>},
   {'category_id': 1,
    'bbox': [1529.9097515856238,
     473.5098039215687,
     1617.167679704017,
     555.3921568627452],
    'bbox_mode': <BoxMode.XYXY_ABS: 0>}]}]

Detectron registers this list of dict as torch dataset and uses the default dataloader and datasampler for training. We can register the list[dict] with detectron2 using following code

def get_dicts():
  ...
  return list[dict] in the above format

from detectron2.data import DatasetCatalog
DatasetCatalog.register("my_dataset", get_dicts)

And to register the metadata information related to dataset like category mapping to id's, the type of dataset, we have to set the keyvalue pair using

MetadataCatalog.get("my_dataset").thing_classes = ["person", "dog"]

Choosing a Model and Initializing Configuration (step 3)

Detectron2 has lot of pretrained model available in the model zoo. For handwritten text detection, we will choose Faster RCNN with FPN backbone.

We have to initialize the parameters and weights for model we want to train.

cfg = get_cfg()
cfg.merge_from_file('<pretrained model config'>)
cfg.MODEL.WEIGHTS = '<path to pretrained model weight>

#custom config for training
cfg.DATASETS.TRAIN = ("<registered training dataset name>",)
cfg.SOLVER.MAX_ITER = '<number of training iterations>'
cfg.MODEL.ROI_HEADS.NUM_CLASSES = '<number of classes>'

All the model configs are available in cfg object. If we want to replicate the training later, we can save the cfg object and load it back to resume training.

Model Training (step 4)

We will use the DefaultTrainer for now. There are simple modules available which only accept the minimal parameters and make assumptions about lot of things.

The DefaultTrainer Module

  • builds the model
  • builds the optimizer
  • builds the dataloader
  • loads the model weights, and
  • register common hooks
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()

Now, we can train our Instance Detection model using Detectron2. We will try FasterRCNN-FPN-50 Model and see how it performs

1. Prepare & Visualize the Dataset

To visualize the labeled dataset in detectron2, we need to convert the xml annotations in the detectron2 dataset format as explained above.

We will use the custom function register_pascal_voc() which will convert the dataset into detectron2 format and register it with DatasetCatalog. It expects the directory structure like

Annotations  
JPEGImages  
train.txt

train.txt and test.txt have a filename(without extension) per row

Visualizer Class

To draw the annotations on the images, we will use the Detectron2 Visualizer class which takes the image in rgb format, the metadata which has ordered label names and the scale parameter.

Visualizer.draw_instance_predictions() function to visualize prediction results
Visualizer.draw_dataset_dict() function to draw the annotated dataset
%matplotlib inline
import numpy as np
import os 
import xml.etree.ElementTree as ET
from detectron2.data import DatasetCatalog, MetadataCatalog
from detectron2.structures import BoxMode

from fvcore.common.file_io import PathManager
import random
import cv2
from detectron2.utils.visualizer import Visualizer
from matplotlib.pyplot import figure
from matplotlib import pyplot as plt
    

def load_voc_instances(dirname, split, CLASS_NAMES):
    """
    Load Pascal VOC detection annotations to Detectron2 format.
    Args:
        dirname: Contain "Annotations", "JPEGImages"
        split (str): one of "train", "test", "val", "trainval"
    """
    with PathManager.open(os.path.join(dirname, split+".txt")) as f:
        fileids = np.loadtxt(f, dtype=np.str)

    dicts = []
    for fileid in fileids:
        anno_file = os.path.join(dirname, "Annotations", fileid + ".xml")
        jpeg_file = os.path.join(dirname, "JPEGImages", fileid + ".jpg")

        tree = ET.parse(anno_file)

        r = {
            "file_name": jpeg_file,
            "image_id": fileid,
            "height": int(tree.findall("./size/height")[0].text),
            "width": int(tree.findall("./size/width")[0].text),
        }
        instances = []

        for obj in tree.findall("object"):
            cls = obj.find("name").text
            # We include "difficult" samples in training.
            # Based on limited experiments, they don't hurt accuracy.
            # difficult = int(obj.find("difficult").text)
            # if difficult == 1:
            # continue
            bbox = obj.find("bndbox")
            bbox = [float(bbox.find(x).text) for x in ["xmin", "ymin", "xmax", "ymax"]]
            # Original annotations are integers in the range [1, W or H]
            # Assuming they mean 1-based pixel indices (inclusive),
            # a box with annotation (xmin=1, xmax=W) covers the whole image.
            # In coordinate space this is represented by (xmin=0, xmax=W)
            bbox[0] -= 1.0
            bbox[1] -= 1.0
            instances.append(
                {"category_id": CLASS_NAMES.index(cls), "bbox": bbox, "bbox_mode": BoxMode.XYXY_ABS}
            )
        r["annotations"] = instances
        dicts.append(r)
    return dicts

def visualize_dataset(datasetname, n_samples=10):

    dataset_dicts = DatasetCatalog.get(datasetname)
    metadata = MetadataCatalog.get(datasetname)

    for d in random.sample(dataset_dicts,n_samples):
        print(d['file_name'])
        img = cv2.imread(d["file_name"])
        visualizer = Visualizer(img[:, :, ::-1],
        metadata=metadata, scale=0.5)
        vis = visualizer.draw_dataset_dict(d)
        figure(num=None, figsize=(15, 15), dpi=100, facecolor='w', edgecolor='k')
        plt.axis("off")
        plt.imshow(vis.get_image()[:, :, ::-1])
        plt.show()

        
def register_pascal_voc(name, dirname, split, CLASS_NAMES):
    if name not in DatasetCatalog.list():
        DatasetCatalog.register(name, lambda: load_voc_instances(dirname, split, CLASS_NAMES))
        
    MetadataCatalog.get(name).set(
        thing_classes=CLASS_NAMES, split=split, dirname= dirname, year=2012
    )
#register pascal voc dataset in detectron2
register_pascal_voc('signature_dataset_train', dirname='datasets', split='train', CLASS_NAMES=["signature","others"])
visualize_dataset('signature_dataset_train',n_samples = 4)
datasets/JPEGImages/07653e58-24d1-4b3f-9b4a-76057efe5c09-3.jpg
datasets/JPEGImages/7674b81e-aa42-4891-856d-8938620d6fa0-1.jpg
datasets/JPEGImages/84cce561-1ee5-4201-9dfe-13da1711ca75-2.jpg
datasets/JPEGImages/9854ded9-3bd7-437c-83c4-b05d409c5872-2.jpg

2. Model Training

from detectron2.engine import default_argument_parser
from detectron2.engine import DefaultTrainer
from detectron2.engine import default_setup
from detectron2.config import get_cfg


def setup_cfg(args):
    """
    Create configs and perform basic setups.
    """
    cfg = get_cfg()
    cfg.merge_from_file(args.config_file)
    cfg.merge_from_list(args.opts)
    cfg.freeze()
    default_setup(cfg, args)
    return cfg


parser = default_argument_parser() 
args = parser.parse_args("--config-file sign_config/sign_faster_rcnn_R_50_FPN_3x.yaml OUTPUT_DIR sign_model ".split())

We have copied the config file for Faster RCNN R50 FPN from the model zoo as sign_faster_rcnn_R_50_FPN_3x.yaml and updated the configuration parameters. We have set the MODEL.ROI_HEADS(classes) to 2, Max Number of iterations to 4000, and training dataset name to the one we registered earlier.

config.setup_cfg function will load the configuration from the --config-file path, and will update the configration with other parameters passed as arguments

Here, we have passed the OUTPUT_DIR parameter to update the cfg.OUTPUT_DIR parameter value

cfg = setup_cfg(args)

Now that we have all the configurations, we can start training the model.

As I explained earlier, DefaultTrainer will build the model(without weights), optimizer, learning rate scheduler and then load weights from the checkpoint file specified in the cfg.MODEL.WEIGHTS parameter.

trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()

3. Model Prediction

Now that the model has been trained and saved in the output directory. The config saved during the model training has all the parameters except model weight. We pass the model weight path as paramter to load the trained model weight.

The DefaultPredictor does image translation and takes only single image for prediction. But we can easily modify the DefaultPredictor class to accept batch of input images for prediction

from detectron2.engine import default_argument_parser
from detectron2.engine import DefaultPredictor
import config

parser = default_argument_parser()
args = parser.parse_args("--config-file sign_model/config.yaml MODEL.WEIGHTS sign_model/model_final.pth".split())
cfg = config.setup_cfg(args)

predictor = DefaultPredictor(cfg)
import glob 
import time
import os 

from matplotlib.pyplot import figure
from matplotlib import pyplot as plt
import cv2

from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog

files = glob.glob("test_images/*.jpg")
sample_size = 5
for file,_ in zip(files,range(sample_size)):
    im = cv2.imread(file)
    MetadataCatalog.get("signature_dataset_train").thing_classes = ["signature","others"]
    start_time = time.time()
    outputs = predictor(im)
    print(time.time()- start_time)
    
    v = Visualizer(im[:, :, ::-1], metadata=MetadataCatalog.get("signature_dataset_train"), scale=0.5)
    v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    print(file)
    figure(num=None, figsize=(15, 15), dpi=100, facecolor='w', edgecolor='k')
    plt.axis("off")
    plt.imshow(v.get_image()[:, :, ::-1])
    plt.show()
0.4223031997680664
test_images/image_11.jpg
0.18352723121643066
test_images/0d0eddfc-731b-44de-b84d-d265afc7d996-1.jpg
0.17967772483825684
test_images/07653e58-24d1-4b3f-9b4a-76057efe5c09-6.jpg
0.14139199256896973
test_images/0d0eddfc-731b-44de-b84d-d265afc7d996-2.jpg
0.15801262855529785
test_images/image_10.jpg

4. Evaluate

The DefaultTrainer class doesn't have a evaluator method implemented. I have created a new Trainer class and added the build_evaluator method. We could have used this new Trainer class in the first step instead of DefaultTrainer but I wanted to show how easy it is to train the model without writing more code.

from detectron2.engine import default_argument_parser
import config
import trainer
import dataset_utils

dataset_utils.register_pascal_voc('signature_dataset_train', dirname='datasets', split='train', CLASS_NAMES=["signature","others"])
#dataset_utils.register_pascal_voc('signature_dataset_test', dirname='datasets', split='test', CLASS_NAMES=["signature","others"])
                                  
parser = default_argument_parser()
args = parser.parse_args("--config-file sign_model/config.yaml MODEL.WEIGHTS sign_model/model_final.pth".split())
trainer.eval(args)

OrderedDict([('bbox', {'AP': 69.67298193268152, 'AP50': 98.10585383880746, 'AP75': 88.13255308840152})])


Training on a Custom Dataset

Let us say we have got some more annotated dataset which is not in PASCAL VOC xml format. To train the above model, we have write a custom function get_dicts() which returns data in detectron2 format.

To improve the accuracy of handwriting detection, I found one more dataset which is of annotated french documents. The annotations are in json format for each image. The dataset is available in the following github repo https://github.com/hyperlex/Signature-detection-Practical-guide/tree/master/data/dataset. Download and save it in french_dataset directory

I have added all the functions in the library files dataset_utils.py and trainer.py. We will use these abstractions to quickly train and evaluate new models

import json 
import glob
import os
import cv2
from detectron2.structures import BoxMode
def get_french_dicts(annot_dir):
    json_files = glob.glob(os.path.join(annot_dir,'*.json'))
    
    dataset_dicts = []
    
    for f in json_files:
        record={}
        img_ann = json.load(open(f))
        
        filename = img_ann['asset']['name']
        height, width = cv2.imread(os.path.join(annot_dir,'..',filename)).shape[:2]
        
        record["file_name"] = os.path.join(annot_dir,'..',filename)
        record["image_id"] = img_ann['asset']['id']
        record["height"] = height
        record["width"] = width
      
        annos = img_ann["regions"]
        objs =[]
        for ann in annos:
            px = ann['boundingBox']['left']
            py = ann['boundingBox']['top']
            px1 = ann['boundingBox']['left'] + ann['boundingBox']['width']
            py1 = ann['boundingBox']['top'] + ann['boundingBox']['height']
            
            obj = {
                "bbox": [px, py, px1, py1],
                "bbox_mode": BoxMode.XYXY_ABS,
                "category_id": {'signature':0,'paraphe':1,'date':1}[ann['tags'][0]],
                "iscrowd": 0
            }
            objs.append(obj)
        record["annotations"] = objs
        dataset_dicts.append(record)
    return dataset_dicts
from detectron2.utils.visualizer import Visualizer
from detectron2.data import DatasetCatalog, MetadataCatalog
import dataset_utils 

def get_img_dicts():
    ann1 = dataset_utils.load_voc_instances(dirname = 'datasets', split = 'train', CLASS_NAMES=["signature","others"])
    ann2 = get_french_dicts('french_dataset/per_img_labels')
    return ann1 + ann2

dataset_name = 'signature_dataset_train'
DatasetCatalog.register(dataset_name, lambda: get_img_dicts())
MetadataCatalog.get(dataset_name).set(thing_classes=["signature","others"], split='train', dirname= dirname, year=2012)
Metadata(name='signature_dataset_train', thing_classes=['signature', 'others'])
len(DatasetCatalog.get(dataset_name))
139
dataset_utils.visualize_dataset('signature_dataset_train', n_samples=2)
datasets/JPEGImages/63348ad3-b0cc-45d0-bc85-bf2c865744ec-2.jpg
datasets/JPEGImages/4c85bb9b-1d8d-45b0-8f4a-664d77ee4b83-4.jpg

We can run the remaining steps as we did before for training the model and prediction

from detectron2.engine import default_argument_parser
from detectron2.engine import DefaultTrainer
import trainer

parser = default_argument_parser() 
args = parser.parse_args("--config-file sign_config/chk_faster_rcnn_R_50_FPN_3x.yaml --num-gpus 3 OUTPUT_DIR french_sign_model SOLVER.MAX_ITER 4000".split())
trainer.train(args)
from detectron2.engine import default_argument_parser
import config
import trainer
import dataset_utils

import dataset_utils

dataset_utils.register_pascal_voc('signature_dataset_test', dirname='datasets', split='train', CLASS_NAMES=["signature","others"])

parser = default_argument_parser()
args = parser.parse_args('--config-file french_sign_model/config.yaml MODEL.WEIGHTS french_sign_model/model_final.pth DATASETS.TEST ("signature_dataset_test",)'.split())
trainer.eval(args)

OrderedDict([('bbox', {'AP': 67.82590895115814, 'AP50': 96.77629667360196, 'AP75': 84.18387912626437})])

The Average Precision has reduced compared to the previous model. Let us check the prediction results.

from detectron2.engine import default_argument_parser
from detectron2.engine import DefaultPredictor
import config

parser = default_argument_parser()
args = parser.parse_args("--config-file french_sign_model/config.yaml MODEL.WEIGHTS french_sign_model/model_final.pth".split())
cfg = config.setup_cfg(args)

predictor = DefaultPredictor(cfg)
import glob 
import time
import os 

from matplotlib.pyplot import figure
from matplotlib import pyplot as plt
import cv2

from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog

files = glob.glob("test_images/*.jpg")
sample_size = 5
for file,_ in zip(files,range(sample_size)):
    im = cv2.imread(file)
    MetadataCatalog.get("signature_dataset_train").thing_classes = ["signature","others"]
    start_time = time.time()
    outputs = predictor(im)
    print(time.time()- start_time)
    
    v = Visualizer(im[:, :, ::-1], metadata=MetadataCatalog.get("signature_dataset_train"), scale=0.5)
    v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    print(file)
    figure(num=None, figsize=(15, 15), dpi=100, facecolor='w', edgecolor='k')
    plt.imshow(v.get_image()[:, :, ::-1])
    plt.show()
0.4847722053527832
test_images/image_11.jpg
0.18241333961486816
test_images/0d0eddfc-731b-44de-b84d-d265afc7d996-1.jpg
0.17200803756713867
test_images/07653e58-24d1-4b3f-9b4a-76057efe5c09-6.jpg
0.18814849853515625
test_images/0d0eddfc-731b-44de-b84d-d265afc7d996-2.jpg
0.19216012954711914
test_images/image_10.jpg